Introduction to Data Programming

ggplot2 is a data visualization package for the statistical programming language R. Created by Hadley Wickham in 2005, ggplot2 is an implementation of Leland Wilkinson’s Grammar of Graphics, a general scheme for data visualization which breaks up graphs into semantic independent components, such as scales and layers, that can be composed in many different ways. This makes ggplot2 very powerful, because there are no limitations due to a set of pre-specified graphics, so it is possible to create new graphics that are precisely tailored for the problem in analysis.

require(ggplot2)

In this chapter we will know how to build the following types of graph with ggplot2:

  • scatterplot
  • line graph
  • bar graph
  • histogram
  • box plot

Scatterplot

Scatter plots are used to display the relationship between two continuous variables. Axes represent a variable each, while each point represents an observation. This plot is often the first way to describe data when you look at it.

Let us know how to build a scatterplot.
Suppose you are interested in the relationship between the humidity and the viscosity in the bands dataset.
In particular, bands dataset provides data about process delays known as cylinder banding in rotogravure printing.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + geom_point()

The syntax is quite simple. The function ggplot() initializes the plot with the following parameters:

  • data refers to the dataframe to use for the plot, in this case is the bands data frame
  • mapping refers to the list of aesthetic mappings (visual properties) to use for plot, passed as arguments of the aes() function. In this case you map the data to aesthetics of plot like follow: the x-axis displays the humidity variable and the y-axis displays the viscosity variable. Note that the variable names must be unquoted.

The ggplot() function does not return anything. You have to add to ggplot(), which geometric object (geom) you want to add. A scatter plot is made by points, and so you will use the geom_point() function, without any argument.

Alternatively, the plot can be initialized with ggplot() without any argument and the same arguments have to be passed to geom_point(), but these arguments are valid only for this geom.

ggplot() + geom_point(data=bands, mapping=aes(x=humidity, y=viscosity))

Assigning ggplots to a variable

From a formal point of view, a ggplot is an R object like anything else; so you can assign it to a variable.

gp1 <- ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + geom_point() 

And now you can recall the plot named gp1.

gp1

Once you assigned the scatter plot to gp1, you can add to this plot an horizontal line by doing:

gp2 <- gp1 + geom_hline(yintercept = 50)
gp2

Changing the shape and size of points

To change the aspect of points like shape or size: it suffices to set the shape or the size as a parameter of geom_point().

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + 
  geom_point(shape=2)

The following shapes are available in R graphics. Point shapes from 0 to 14 have just an outline, shapes from 15 to 20 are solid and shapes from 21 to 25 have both an outline and a fill. Default shape for ggplot2 graphics is 16.

Shape can also be a (single) character string.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + 
  geom_point(shape="$", size=3)

The same way works for size too.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + 
  geom_point(size=5)

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + 
  geom_point(shape=3, size=1)

Changing the colour of points

Colour can be set in a similar way than shape and size: setting colour as a geom_point() parameters. Both UK and US spellings (color) can be used.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + 
  geom_point(colour="red")

Shapes from 21 to 25 allow two colours. In these cases, colour set the outline and fill set the internal colour.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) + 
  geom_point(shape=21, colour="red", fill="#FF0000")

When you need to pass a colour to R, you can use a string with the colour name (“red”) or with the hexadecimal code (e.g. “#FF0000”).

Data used in these examples has 540 observations but the plot seems have less points. This is because many points overlap. Transparencies are useful in this cases. The alpha aesthetic set the transparency level: legal alpha values are any numbers from 0 (transparent) to 1 (opaque).

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity)) +
  geom_point(alpha=0.25)

Since alpha=0.25 (and 0.25 is 1/4) a point will be drawn as solid when four points overlap.

Mapping a third variable to scatter plots

Scatter plots were born to visualize the relationship between two variables: one mapped to the x-axis and one mapped to the y-axis. Sometimes, a third variable should be visualized. In these case, you can map a third variable to other aesthetics: size, shape or colour.

Suppose you’re interested in the relationship between humidity and viscosity accordingly the presence or absence of band_type. To perform this task, you have to map band_type to colour.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, colour=band_type)) +
  geom_point()

Note that mapping occurs within aes(), while setting occurs outside of aes().

The plot shows the same points that previous ones with different colours and a legend will be added.

Alternatively, you can map band_type to another aesthetic, like shape.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, shape=band_type)) +
  geom_point()

Since different shapes are more difficult to read when you have many points, this solution provide an alternative when you are printing in black and white, without colours. You can improve your result using both aesthetics together.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, colour=band_type, shape=band_type)) +
  geom_point()

You can map band_type to size, but the result advise against this choice.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, size=band_type)) +
  geom_point()

If you’re interested in a continuous variable as third variable you can map it to colour or to size. It makes no sense map a continuous value to shape.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, colour=ink_pct)) +
  geom_point()

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, size=ink_pct)) +
  geom_point()

It is more difficult perceiving small differences in size and colour, so variable mapped to these aesthetic attributes will be interpreted with a much lower accuracy than those mapped to spatial coordinates (x and y).

Mapping four variables to scatter plots

Although the interpretation may be difficult, different variable can map to different aesthetics at the same time. From a theoretical point of view you can map as many variable as the number of aesthetics, but it is not suggested map more than four variable.

This is the result when you are interested to the ink percentage and band type, at the same time.

ggplot(data=bands, mapping=aes(x=humidity, y=viscosity, size=ink_pct, colour=band_type)) +
  geom_point()

When a variable is mapped to size it is not suggested to map another variable to shape. This is because it is difficult to compare the sizes of different shapes.

Line Graph

Line Graphs are used to display how one continuous variable, on the y-axis, changes in relation to another continuous variable, on the x-axis. It is similar to a scatter plot, except that points are ordered in the x-axis and connected by a segment. Points can also be missing.

Let us know how to build a line graph.
ChickWeight data contains the body weight of 50 chicks over time. Suppose you are interested to the growth of the first chick.

ggplot(data=(ChickWeight %>% filter(Chick==1)), mapping=aes(x=Time, y=weight)) + geom_line()

Points can be easily added with geom_point().

ggplot(data=(ChickWeight %>% filter(Chick==1)), mapping=aes(x=Time, y=weight)) + geom_line() + geom_point()

You can change the appearance of the plot setting an aesthetic to a value. For example, the following plot has a dark blue line with squared and bigger points.

ggplot(data=(ChickWeight %>% filter(Chick==1)), mapping=aes(x=Time, y=weight)) +
  geom_line(colour="darkblue") + geom_point(shape=15, size=2)

Available aesthetic are x, y, alpha, colour and size with the same (intuitive) meaning already seen for geom_point().

To choose the style of the line you can use linetype.

ggplot(data=(ChickWeight %>% filter(Chick==1)), mapping=aes(x=Time, y=weight)) +
  geom_line(colour="darkblue", linetype=2)

The linetype can be a number (0-6) or a description (like "solid" or "dashed"). Available line types are:

Mapping variables to line graphs

If you are interested in the average growth of chicks for the different diets, you must first summarize data.

ChickWeightMean <- ChickWeight %>% group_by(Time, Diet) %>% summarize(weight=mean(weight))
ChickWeightMean
## Source: local data frame [48 x 3]
## Groups: Time [?]
## 
##     Time   Diet   weight
##    (dbl) (fctr)    (dbl)
## 1      0      1 41.40000
## 2      0      2 40.70000
## 3      0      3 40.80000
## 4      0      4 41.00000
## 5      2      1 47.25000
## 6      2      2 49.40000
## 7      2      3 50.40000
## 8      2      4 51.80000
## 9      4      1 56.47368
## 10     4      2 59.80000
## ..   ...    ...      ...

If you draw the same plot as above with new data, you will obtain a figure like this one.

ggplot(data=ChickWeightMean, mapping=aes(x=Time, y=weight)) +
  geom_line() + geom_point()

A jagged line appears when there are multiple data at each x and you tell ggplot to connect them. A plot like this one should sounds as a warning that something is wrong. In order to have a single point for each x you can summarize your data, or you can draw several lines, accordingly to a third variable, just adding a new aesthetic.

ggplot(data=ChickWeightMean, mapping=aes(x=Time, y=weight, colour=Diet)) +
  geom_line() + geom_point()

Bar graph

Bar Graphs are used to display numeric values for different categories. Although they appears similar to Histograms, these plots are very different: bar plot are used for categorical x values, bars should be spaced and the weight of the bar has no meaning while histograms are used for continuous y values, bars (that are called bins) must not be spaced and the weight of bins depends on data.

To avoid confusion between bar graphs and histograms, some authors suggest to build bars horizontally.

Let us know how to build a bar graph.
As seen in the previous paragraph, in ChickWeight dataset chicks receive one of four diets. A graphical summary about how chicks receive each diet can be obtained.

ggplot(data=ChickWeight, mapping=aes(x=Diet)) + geom_bar()

If you prefer horizontal bars, just flip the plot with coord_flip().

ggplot(data=ChickWeight, mapping=aes(x=Diet)) + geom_bar() + coord_flip()

Another way to distinguish bar graphs and histograms are tiny bars, to increase the space among bars.

ggplot(data=ChickWeight, mapping=aes(x=Diet)) + geom_bar(width=0.5)

Setting and mapping variables to bar graphs

Colours and other available aesthetics (linetype and size) can be set or mapped as usual. Remember that colour controls the bar outline, while fill controls the bar colour.

If you do not set four different fill colours, all bars have the same colours as shown above. You can specify bar colours in fill argument of geom_bar():

ggplot(data=ChickWeight, mapping=aes(x=Diet)) +
  geom_bar(fill=c("#74a9cf", "#3690c0", "#0570b0", "#034e7b"))

or mapping fill colours to the level of Diet in geom_bar() function:

ggplot(data=ChickWeight, mapping=aes(x=Diet)) +
  geom_bar(mapping=aes(fill=Diet))

Summarized data and stats

Sometimes data comes already summarized. As an example, you can have the following frequency tables without the original data.

ChickWeightFreq <- ChickWeight %>% group_by(Diet) %>% summarize(n=n())
ChickWeightFreq
## Source: local data frame [4 x 2]
## 
##     Diet     n
##   (fctr) (int)
## 1      1   220
## 2      2   120
## 3      3   120
## 4      4   118

You can try to build the previous plot in the same way.

ggplot(data=ChickWeightFreq, mapping=aes(x=Diet)) + geom_bar()

The result is a plot with four bars of length 1. Why? The reason is quite simple. By default, geom_bar() scans the Diet column counting how many observations have Diet=1, how many have Diet=2 and so on. In this case, you must tell ggplot you already have the count.

ggplot(data=ChickWeightFreq, mapping=aes(x=Diet, y=n)) + geom_bar(stat="identity")

Since you have an y variable containing frequencies, you tell ggplot the variable containing counts (y=n in the example) and to geom_bar that stat="identity" must be used.

A statistical transformation, or stat, transforms the data, typically by summarizing it in some manner. By default, almost all geoms seen until now uses stat="identity", that do not transform data. As just seen, by default geom_bar() uses stat="count" that counts the number of cases at each x position. If you do not want to transform your data, stat="identity" must be supplied.

Stacked and grouped bar graphs

If you are a quality engineer analysing the bands data, you may be interested to the number of presses for each type, distinguishing them by the cylinder size.

ggplot(data=bands, mapping=aes(x=press_type, fill=cylinder_size)) + geom_bar()

Mapping fill to cylinder_size does the job. Notice a small gray area at the top of the last two bars: this means there are few cases in which cylinder_size is missing.

Sometimes you may be interested at the distribution of cylinder_size for each press_type.

ggplot(data=bands, mapping=aes(x=press_type, fill=cylinder_size)) + geom_bar(position="fill")

In the case you prefer a bar for each combination of press_type and cylinder_size, you can dodge the bars.

ggplot(data=bands, mapping=aes(x=press_type, fill=cylinder_size)) + geom_bar(position="dodge")

Histogram

Histograms are used to summarize a continuous variable into classes, called bins. The area (and not the height) of each bin is proportional to the frequency of cases in the bin. The vertical axis is not frequency but density. When bins are equal size, a rectangle is erected over the bin with height proportional to the frequency. As the adjacent bins leave no gaps, the rectangles of a histogram touch each other to indicate that the original variable is continuous.

Let us know how to build an histogram.
Suppose you are a quality engineer and you are interested to the distribution of ink percentage in bands data.

ggplot(data=bands, mapping=aes(x=ink_pct)) + geom_histogram()

As usual, aesthetics can be set to modify the appearance of plot.

ggplot(data=bands, mapping=aes(x=ink_pct)) + geom_histogram(fill="#2B4C6F", colour="#3690c0")

By default, the data is grouped into 30 bins. You can modify the number of bins:

  • setting the number of bins:

    ggplot(data=bands, mapping=aes(x=ink_pct)) + geom_histogram(fill="#2B4C6F", colour="#3690c0", bins=6)

  • setting the width of each bin:

    ggplot(data=bands, mapping=aes(x=ink_pct)) + geom_histogram(fill="#2B4C6F", colour="#3690c0", binwidth=7)

Mapping variables to histograms and faceting

May be interesting analysing the distribution of ink percentage for each level of proof_on_ctd_ink.

ggplot(data=bands, mapping=aes(x=ink_pct)) +
  geom_histogram(mapping=aes(fill=proof_on_ctd_ink))

Mapping the grouping variable to fill, two overlapped distributions are shown. In some cases, as shown above, this solution may work. In other cases, the result may be very difficult to be understood.

ggplot(data=bands, mapping=aes(x=ink_pct)) +
  geom_histogram(mapping=aes(fill=press_type))

In this cases, four different histograms may produce a more readable result.

ggplot(data=bands, mapping=aes(x=ink_pct)) +
  geom_histogram(fill="#2B4C6F") +
  facet_grid(. ~ press_type)

facet_grid() produces a different panel for each level of press_type. It requires a formula style: rows ~ columns. The dot in the formula is used to indicate there should be no faceting on this dimension (either row or column). The following example shows faceting on rows.

ggplot(data=bands, mapping=aes(x=ink_pct)) +
  geom_histogram(fill="#2B4C6F") +
  facet_grid(type_on_cylinder ~ .)

Faceting on both dimensions:

ggplot(data=bands, mapping=aes(x=ink_pct)) +
  geom_histogram(fill="#2B4C6F") +
  facet_grid(type_on_cylinder ~ press_type)

Box Plot

A Box-and-Whiskers Plot, or Box Plot, is a convenient way to draw data distribution. The box ranges from the first quartile to the third (inter-quartile range or IQR) with a line indicating the median (second quartile). The whiskers contains the lowest datum still within 1.5 IQR of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile. If there are data outside the range of whiskers, they are represented by a dot. Box Plots are very popular among data analyst, but they are not suggested for a wider audience. Box plots can be drawn either horizontally or vertically.

Let us know how to build a box plot.
Supposing you are interested in the differences of ink percentage accordingly to the type of press in bands data. You can build four box plots to compare distributions.

ggplot(data=bands, aes(x=press_type, y=ink_pct)) + 
  geom_boxplot(fill="#3690c0")

Appearance aesthetics work as seen until now; fill controls the box filling, while colour controls the box outline, whiskers and outline points.

ggplot(data=bands, aes(x=press_type, y=ink_pct)) + 
  geom_boxplot(fill="#74a9cf", colour="#034e7b")

There are also a few outline.* parameter to set aesthetics for outlier points.

ggplot(data=bands, aes(x=press_type, y=ink_pct)) + 
  geom_boxplot(fill="#74a9cf", colour="#034e7b", outlier.colour="red", outlier.shape=18, outlier.size=3)

Customizing Titles and Axis

Also axes title can be changed. This is useful when the name of the variable is not enough clear. xlab() function sets the y-axis title, ylab() set the y-axis title, while ggtitle() should be used when you want a title for the whole plot. Note the use of the escape code \n to break lines.

ggplot(data=bands, aes(x=press_type, y=ink_pct)) + 
  geom_boxplot(fill="#74a9cf", colour="#034e7b") +
  xlab("Press type") + ylab ("Ink %") + ggtitle("Distribution\n(bands data set)")

Then, theme() function sets many theme settings. Here only a few options regarding axis and titles are shown:

  • axis.ticks.x (or axis.ticks.y) controls the ticks of x-axis (or y-axis). axis.ticks performs the same action to both axis. You can suppress ticks with element_blank() or use element_line() to change the aspect of ticks setting colour, size or linetype.

    ggplot(data=bands, aes(x="0", y=ink_pct)) + 
      geom_boxplot(fill="#74a9cf", colour="#034e7b") +
      xlab("Press type") + ylab ("Ink %") + ggtitle("Distribution\n(bands data set)") +
      theme(axis.ticks.x = element_line(colour="green", size=4), axis.ticks.y=element_line(colour="red"))

  • axis.text.x, axis.text.y and axis.text controls the text of x, y or both axis ticks. You can suppress text with element_blank() or use element_text() to change the aspect of ticks setting font family, font face, colour, size, angle and other options.

    ggplot(data=bands, aes(x=press_type, y=ink_pct)) + 
      geom_boxplot(fill="#74a9cf", colour="#034e7b") +
      xlab("Press type") + ylab ("Ink %") + ggtitle("Distribution\n(bands data set)") +
      theme(axis.text = element_text(face="bold", colour="#034e7b"))

  • axis.text.x, axis.text.y and axis.text controls the title of x, y or both axes. You can suppress text with element_blank() or use element_text() to change the aspect of axis text.

    ggplot(data=bands, aes(x=press_type, y=ink_pct)) + 
      geom_boxplot(fill="#74a9cf", colour="#034e7b") +
      xlab("Press type") + ylab ("Ink %") + ggtitle("Distribution\n(bands data set)") +
      theme(axis.title = element_text(face="italic", colour="#034e7b"))

  • plot.title controls the text of the overall plot. You can use element_text() to change the aspect of title.

    ggplot(data=bands, aes(x=press_type, y=ink_pct)) + 
      geom_boxplot(fill="#74a9cf", colour="#034e7b") +
      xlab("Press type") + ylab ("Ink %") + ggtitle("Distribution\n(bands data set)") +
      theme(plot.title = element_text(face="bold", size=18))